library(tidyverse)
Error in exists(cacheKey, where = .rs.WorkingDataEnv, inherits = FALSE) : 
  invalid first argument
library(janitor)
Error in exists(cacheKey, where = .rs.WorkingDataEnv, inherits = FALSE) : 
  invalid first argument
steam <- read_csv("raw_data/steam_checkpoint.csv")
Rows: 26564 Columns: 22── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): name, developer, publisher, multiplayer, categories, genres, steamspy_tags, owners
dbl  (8): appid, required_age, achievements, positive_ratings, negative_ratings, average_playtime, median_playtime, price
lgl  (5): free_to_play, virtual_reality_support, windows_support, mac_support, linux_support
date (1): release_date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.Error in exists(cacheKey, where = .rs.WorkingDataEnv, inherits = FALSE) : 
  invalid first argument

Where was i?

games_pre_cleaned <- read_csv("raw_data/Cleaned Data 2 GVGS&R.csv")
Rows: 6894 Columns: 15── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): Name, Genre, Publisher, Developer, Rating
dbl (10): Year_of_Release, NA_Sales, EU_Sales, JP_Sales, Other_Sales, Global_Sales, Critic_Score, Critic_Count, User_Score, User_Count
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.Error in exists(cacheKey, where = .rs.WorkingDataEnv, inherits = FALSE) : 
  invalid first argument
Error in assign(cacheKey, frame, .rs.CachedDataEnv) : 
  attempt to use zero-length variable name
games_pre_cleaned

How dare you do the work for me

games_raw <- read_csv("raw_data/Raw Data GVGS&R.csv")
Rows: 16719 Columns: 16── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (8): Name, Platform, Year_of_Release, Genre, Publisher, User_Score, Developer, Rating
dbl (8): NA_Sales, EU_Sales, JP_Sales, Other_Sales, Global_Sales, Critic_Score, Critic_Count, User_Count
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
games_raw %>% 
  filter(Name == "Minecraft")

Why does everyone have the same dataset

back to steam i guess

steam
steam %>% 
  distinct(genres)

Already have F2P covered by another column, so we can drop that in genres

steam %>% 
  mutate(genres = str_remove_all(genres, "Free to Play")) %>% 
  filter(free_to_play == TRUE)

Is this worth doing? Duplicated info doesn’t hurt i suppose - leave it for now

single player?

steam <- steam %>% 
  mutate(singleplayer = case_when(
    str_detect(categories, "Single-player") ~ TRUE,
    str_detect(steamspy_tags, "Singleplayer") ~ TRUE,
    TRUE ~ FALSE
  ),.before = multiplayer)
steam %>% 
  filter(free_to_play == TRUE) 
backloggd <- read_csv("clean_data/backloggd_clean.csv")
Rows: 1115 Columns: 15── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr  (9): title, team, genre_tag, genre_tag_2, genre_tag_3, genre_tag_4, genre_tag_5, genre_tag_6, genre_tag_7
dbl  (5): years_since_release, rating, number_of_reviews, wishlist, plays
date (1): release_date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
backloggd
games_raw <- games_raw %>% 
  clean_names()
# Platform isn't a genre - correcting
games_raw <- games_raw %>% 
  mutate(genre = case_when(
    genre == "Platform" ~ "Platformer",
    TRUE ~ genre
  ))
nintendo_published <- games_raw %>% 
  filter(publisher == "Nintendo")
genre_publisher_sales <- games_raw %>%  
  group_by(publisher, genre) %>% 
  summarise(sum(global_sales)) %>% 
  arrange(desc(`sum(global_sales)`))
`summarise()` has grouped output by 'publisher'. You can override using the `.groups` argument.
write_csv(genre_publisher_sales, "clean_data/genre_by_publisher.csv")
# Adding PEGI equivalent 
games_raw <- games_raw %>% 
  mutate(pegi_equivalent = case_when(
    esrp == "EC" ~ "3",
    esrp == "E" ~ "7",
    esrp == "E10+" ~ "12",
    esrp == "T" ~ "16",
    esrp == "M" ~ "18",
    esrp == "AO" ~ "18",
    esrp == "K-A" ~ "7",
    esrp == "RP" ~ "No Rating",
    TRUE ~ "No Rating Found"
  )) 
  

how does the sales for the top 3 genres look in each region?

What about trends? How did each genres sales adjust every 5 or so years?

sales_2019 <- read_csv("raw_data/sales-2019.csv")
Rows: 55792 Columns: 23── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (10): Name, basename, Genre, ESRB_Rating, Platform, Publisher, Developer, Last_Update, url, img_url
dbl (12): Rank, Critic_Score, User_Score, Total_Shipped, Global_Sales, NA_Sales, PAL_Sales, JP_Sales, Other_Sales, Year, status, Vgchartzscore
lgl  (1): VGChartz_Score
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Going off the fact that there is the majority of usable information i am able to get is from 2019 or before, we’re just going to pretend we’re living in an

alternative 2019

As such, i am dropping all data from 2020 in this dataset

Looking at backloggd again

backloggd %>% 
  mutate(rating_range = case_when(
    rating < 5 & rating >= 4 ~ "4+",
    rating < 4 & rating >= 3 ~ "3 to 4",
    rating < 3 & rating >= 2 ~ "2 to 3",
    rating < 2 & rating >= 1 ~ "1 to 2",
    rating < 1 ~ ">1",
    TRUE ~ "No rating"
  ))
---
title: "DAY 2"
output: html_notebook
---

```{r}
library(tidyverse)
library(janitor)

steam <- read_csv("raw_data/steam_checkpoint.csv")
```

Where was i?

```{r}
games_pre_cleaned <- read_csv("raw_data/Cleaned Data 2 GVGS&R.csv")
```

```{r}
games_pre_cleaned
```
How dare you do the work for me

```{r}
games_raw <- read_csv("raw_data/Raw Data GVGS&R.csv")
```

```{r}
games_raw %>% 
  filter(Name == "Minecraft")
```
Why does everyone have the same dataset

back to steam i guess

```{r}
steam
```

```{r}
steam %>% 
  distinct(genres)
```

Already have F2P covered by another column, so we can drop that in genres

```{r}
steam %>% 
  mutate(genres = str_remove_all(genres, "Free to Play")) %>% 
  filter(free_to_play == TRUE)
```

Is this worth doing?
Duplicated info doesn't hurt i suppose - leave it for now

single player?

```{r}
steam <- steam %>% 
  mutate(singleplayer = case_when(
    str_detect(categories, "Single-player") ~ TRUE,
    str_detect(steamspy_tags, "Singleplayer") ~ TRUE,
    TRUE ~ FALSE
  ),.before = multiplayer)
```

```{r}
steam %>% 
  filter(free_to_play == TRUE) 
```
```{r}
steam
```
```{r}
# adding column to simplify ratings 
steam <- steam %>% 
  mutate(total_reviews = positive_ratings +  negative_ratings) %>% 
  arrange(desc(total_reviews)) %>% 
  mutate(percent_positive_reviews = positive_ratings / total_reviews) %>% 
  mutate(general_rating = case_when(
    total_reviews > 1000 & percent_positive_reviews >= 0.95 ~ "Extremely Positive",
    total_reviews > 1000 & percent_positive_reviews >= 0.75 ~ "Positive",
    total_reviews > 1000 & percent_positive_reviews >= 0.55 ~ "Mostly Positive",
    total_reviews > 1000 & percent_positive_reviews >= 0.50 ~ "Mixed" ,
    total_reviews > 1000 & percent_positive_reviews >= 0.35 ~ " Mostly Negative",
    total_reviews > 1000 & percent_positive_reviews >= 0.15 ~ "Negative",
    total_reviews > 1000 & percent_positive_reviews < 0.15 ~ "Extremely Negative",
    positive_ratings == 0 ~ "No Positive Reviews",
    negative_ratings == 0 ~ "No Negative Reviews",
    positive_ratings & negative_ratings == 0 ~ "No Reviews",
    total_reviews < 1000 ~ "Not Enough Reviews",
    TRUE ~ "No Data"
  ),.before = positive_ratings) %>% 
  select(-percent_positive_reviews)
```



```{r}
backloggd <- read_csv("clean_data/backloggd_clean.csv")
```

```{r}
backloggd
```


```{r}
games_raw <- games_raw %>% 
  clean_names()
```

```{r}
# Platform isn't a genre - correcting
games_raw <- games_raw %>% 
  mutate(genre = case_when(
    genre == "Platform" ~ "Platformer",
    TRUE ~ genre
  ))
```

```{r}
nintendo_published <- games_raw %>% 
  filter(publisher == "Nintendo")
```

```{r}
genre_publisher_sales <- games_raw %>%  
  group_by(publisher, genre) %>% 
  summarise(sum(global_sales)) %>% 
  arrange(desc(`sum(global_sales)`))
```
```{r}
write_csv(genre_publisher_sales, "clean_data/genre_by_publisher.csv")
```

```{r}
genre_publisher_sales %>% 
  filter(publisher == "From Software")
```

```{r}
games_raw %>% 
  distinct(rating)
```


```{r}
# clarifying that the rating is for ESRP - the american system
games_raw <- games_raw %>% 
  rename("esrp" = "rating") 
```

```{r}
# Adding PEGI equivalent 
games_raw <- games_raw %>% 
  mutate(pegi_equivalent = case_when(
    esrp == "EC" ~ "3",
    esrp == "E" ~ "7",
    esrp == "E10+" ~ "12",
    esrp == "T" ~ "16",
    esrp == "M" ~ "18",
    esrp == "AO" ~ "18",
    esrp == "K-A" ~ "7",
    esrp == "RP" ~ "No Rating",
    TRUE ~ "No Rating Found"
  )) 
  
```

```{r}
total_sales_by_genre <- games_raw %>% 
  group_by(genre) %>% 
  summarise(sum(global_sales)) %>% 
  rename("total_sales" = "sum(global_sales)") %>% 
  arrange(desc(total_sales))
```

```{r}
total_sales_by_genre <- total_sales_by_genre %>% 
  drop_na()
```

```{r}
games_raw 
```

```{r}
breakdown_sales_by_genre <- games_raw %>% 
  group_by(genre) %>% 
  summarise(sum(na_sales), sum(eu_sales), sum(jp_sales), sum(other_sales)) %>% 
  rename("total_na_sales"= "sum(na_sales)", "total_eu_sales" = "sum(eu_sales)", "total_jp_sales" = "sum(jp_sales)", "total_other_sales" = "sum(other_sales)")
```

```{r}
total_sales_by_genre # this seems to have some rounding differences compared to breakdown, so i'll retire this one 
```
how does the sales for the top 3 genres look in each region?


```{r}
breakdown_sales_by_genre <- breakdown_sales_by_genre %>% 
  group_by(genre) %>% 
  mutate(total_sales = sum(total_na_sales + total_eu_sales + total_jp_sales + total_other_sales)) %>% 
  drop_na()
```
```{r}

breakdown_sales_by_genre %>% 
  filter(genre %in% c("Action", "Sports", "Shooter")) %>% 
  ggplot(aes(x = total_sales, y = genre))+
  geom_col() +
  geom_col(aes(x = total_na_sales), fill = "green")+
  geom_col(aes(x = total_eu_sales), fill = "red") +
  geom_col(aes(x = total_jp_sales), fill = "blue") 

# this looks shit
```

What about trends? How did each genres sales adjust every 5 or so years?

```{r}
games_raw %>% 
  filter(year_of_release == "2016")

games_raw %>% 
  mutate(year_of_release = as.integer(year_of_release)) %>% 
  distinct(year_of_release) %>% 
  arrange(year_of_release)

```

```{r}
sales_2019 <- read_csv("raw_data/sales-2019.csv")
```

```{r}
sales_2019 %>% 
  distinct(Year) %>% 
  arrange(desc(Year))
```
```{r}
sales_2019 %>% 
  filter(Year == "2019") %>% 
  select(Name, Year, Platform)

# data for 2020 is total nonsense. Some information in 2019 is incorrect.
```
# Going off the fact that there is the majority of usable information i am able to get is from 2019 or before, we're just going to pretend we're living in an 
# alternative 2019

As such, i am dropping all data from 2020 in this dataset 

```{r}
sales_2019 <- sales_2019 %>% 
  filter(Year != "2020")
```

# Looking at backloggd again

```{r}
backloggd
```
```{r}
# creating a range_rating column to reduce noise when it comes to plotting results
backlogged <- backloggd %>% 
  mutate(rating_range = case_when(
    rating < 5 & rating >= 4 ~ "4+",
    rating < 4 & rating >= 3 ~ "3 to 4",
    rating < 3 & rating >= 2 ~ "2 to 3",
    rating < 2 & rating >= 1 ~ "1 to 2",
    rating < 1 ~ ">1",
    TRUE ~ "No rating"
  ))
```



